Worksheet 2: Exploring Two Dimensional Data

This worksheet covers concepts covered in the second half of Module 1 - Exploratory Data Analysis in Two Dimensions. It should take no more than 20-30 minutes to complete. Please raise your hand if you get stuck.

There are many ways to accomplish the tasks that you are presented with, however you will find that by using the techniques covered in class, the exercises should be relatively simple.

Import the Libraries

For this exercise, we will be using:


In [3]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
plt.style.use('ggplot')
%pylab inline


Populating the interactive namespace from numpy and matplotlib

Exercise 1: Reading various forms of JSON Data

In the /data/ folder, you will find a series of .json files called dataN.json, numbered 1-4. Each file contains the following data:

birthday first_name last_name
0 5\/3\/67 Robert Hernandez
1 8\/4\/84 Steve Smith
2 9\/13\/91 Anne Raps
3 4\/15\/75 Alice Muller

Using the .read_json() function and the various configuration options, read all these files into a dataframe. The documentation is available here: http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_json.html.


In [27]:
df1 = pd.read_json('../../data/data1.json')
df1


Out[27]:
birthday first_name last_name
0 5\/3\/67 Robert Hernandez
1 8\/4\/84 Steve Smith
2 9\/13\/91 Anne Raps
3 4\/15\/75 Alice Muller

In [8]:
df2 = pd.read_json('../../data/data2.json', orient='index')
df2


Out[8]:
birthday first_name last_name
0 5\/3\/67 Robert Hernandez
1 8\/4\/84 Steve Smith
2 9\/13\/91 Anne Raps
3 4\/15\/75 Alice Muller

In [30]:
df3 = pd.read_json('../../data/data3.json', orient='columns')
df3


Out[30]:
birthday first_name last_name
0 5\/3\/67 Robert Hernandez
1 8\/4\/84 Steve Smith
2 9\/13\/91 Anne Raps
3 4\/15\/75 Alice Muller

In [31]:
df4 = pd.read_json('../../data/data4.json', orient='split')
df4


Out[31]:
first_name last_name birthday
0 Robert Hernandez 5\/3\/67
1 Steve Smith 8\/4\/84
2 Anne Raps 9\/13\/91
3 Alice Muller 4\/15\/75

Exercise 2:

In the data file, there is a webserver file called hackers-access.httpd. For this exercise, you will use this file to answer the following questions:

  1. Which browsers are the top 10 most used browsers in this data?
  2. Which are the top 10 most used operating systems?

In order to accomplish this task, do the following:

  1. Write a function which takes a User Agent string as an argument and returns the relevant data. HINT: You might want to use python's user_agents module, the documentation for which is available here: (https://pypi.python.org/pypi/user-agents)
  2. Next, apply this function to the column which contains the user agent string.
  3. Store this series as a new column in the dataframe
  4. Count the occurances of each value in the new columns

In [6]:
import apache_log_parser
from user_agents import parse
#Read in the log file
line_parser = apache_log_parser.make_parser("%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-agent}i\"")

server_log = open("../../data/hackers-access.httpd", "r")
parsed_server_data = []
for line in server_log:
    data = {}
    data = line_parser(line)
    parsed_server_data.append( data )

server_df = pd.DataFrame( parsed_server_data  )

In [11]:
#Write the functions
def get_os(x):
    user_agent = parse(x)
    return user_agent.os.family

def get_browser(x):
    user_agent = parse(x)
    return user_agent.browser.family

In [12]:
#Apply the functions to the dataframe
server_df['os'] = server_df['request_header_user_agent'].apply( get_os )
server_df['browser'] = server_df['request_header_user_agent'].apply( get_browser )

In [13]:
#Get the top 10 values
server_df['os'].value_counts().head(10)


Out[13]:
Windows 7        2041
Windows Vista     500
Windows XP        423
Windows 8.1       221
Linux             125
Mac OS X           80
Chrome OS          60
Ubuntu              6
Name: os, dtype: int64

In [14]:
server_df['browser'].value_counts().head(10)


Out[14]:
Firefox    1476
Chrome     1322
Opera       556
IE           78
Safari       24
Name: browser, dtype: int64

Exercise 3:

Using the dailybots.csv film, read the file into a DataFrame and perform the following operations:

  1. Filter the DataFrame to include bots from the Government/Politics Industry.
  2. Calculate the ratio of hosts to orgs and add this as a column to the DataFrame and output the result
  3. Calculate the total number of hosts infected by each BotFam in the Government/Politics Industry. You should use the groupby() function which is documented here: (http://pandas.pydata.org/pandas-docs/stable/groupby.html)

In [20]:
bots = pd.read_csv('../../data/dailybots.csv')
bots.head()


Out[20]:
date botfam industry hosts orgs
0 2016-06-01 Bedep Education 88 33
1 2016-06-01 Bedep Finance 387 17
2 2016-06-01 Bedep Government/Politics 430 42
3 2016-06-01 Bedep Healthcare/Wellness 42 19
4 2016-06-01 Bedep Manufacturing 184 18

In [24]:
gov_bots = bots[['botfam', 'hosts']][bots['industry'] == "Government/Politics"]

In [26]:
gov_bots.groupby('botfam', as_index=False).sum()


Out[26]:
botfam hosts
0 Bedep 6183
1 ConfickerAB 28818
2 Necurs 12425
3 Olmasco 83
4 PushDo 2164
5 Ramnit 4484
6 Sality 7466
7 Zeus 965
8 Zusy 227
9 zeroaccess 886